assignment step
Nested Mini-Batch K-Means
James Newling, François Fleuret
A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm, already used data should preferentially be reused. To this end we propose using nested mini-batches, whereby data in a mini-batch at iteration t is automatically reused at iteration t+1. Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1% of the empirical minimum 100 earlier than the standard mini-batch algorithm.
Fixed-sized clusters $k$-Means
Malinen, Mikko I., Fränti, Pasi
We present a $k$-means-based clustering algorithm, which optimizes the mean square error, for given cluster sizes. A straightforward application is balanced clustering, where the sizes of each cluster are equal. In the $k$-means assignment phase, the algorithm solves an assignment problem using the Hungarian algorithm. This makes the assignment phase time complexity $O(n^3)$. This enables clustering of datasets of size more than 5000 points.
Nested Mini-Batch K-Means
A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm, already used data should preferentially be reused. To this end we propose using nested mini-batches, whereby data in a mini-batch at iteration t is automatically reused at iteration t + 1. Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1% of the empirical minimum 100 earlier than the standard mini-batch algorithm.
Memetic Differential Evolution Methods for Semi-Supervised Clustering
Mansueto, Pierluigi, Schoen, Fabio
In this paper, we deal with semi-supervised Minimum Sum-of-Squares Clustering (MSSC) problems where background knowledge is given in the form of instance-level constraints. In particular, we take into account "must-link" and "cannot-link" constraints, each of which indicates if two dataset points should be associated to the same or to a different cluster. The presence of such constraints makes the problem at least as hard as its unsupervised version: it is no more true that each point is associated to its nearest cluster center, thus requiring some modifications in crucial operations, such as the assignment step. In this scenario, we propose a novel memetic strategy based on the Differential Evolution paradigm, directly extending a state-of-the-art framework recently proposed in the unsupervised clustering literature. As far as we know, our contribution represents the first attempt to define a memetic methodology designed to generate a (hopefully) optimal feasible solution for the semi-supervised MSSC problem. The proposal is compared with some state-of-the-art algorithms from the literature on a set of well-known datasets, highlighting its effectiveness and efficiency in finding good quality clustering solutions.
Wasserstein k-means with sparse simplex projection
Fukunaga, Takumi, Kasai, Hiroyuki
This paper presents a proposal of a faster Wasserstein $k$-means algorithm for histogram data by reducing Wasserstein distance computations and exploiting sparse simplex projection. We shrink data samples, centroids, and the ground cost matrix, which leads to considerable reduction of the computations used to solve optimal transport problems without loss of clustering quality. Furthermore, we dynamically reduced the computational complexity by removing lower-valued data samples and harnessing sparse simplex projection while keeping the degradation of clustering quality lower. We designate this proposed algorithm as sparse simplex projection based Wasserstein $k$-means, or SSPW $k$-means. Numerical evaluations conducted with comparison to results obtained using Wasserstein $k$-means algorithm demonstrate the effectiveness of the proposed SSPW $k$-means for real-world datasets
Fast K-Means Clustering with Anderson Acceleration
Zhang, Juyong, Yao, Yuxin, Peng, Yue, Yu, Hao, Deng, Bailin
We propose a novel method to accelerate Lloyd's algorithm for K-Means clustering. Unlike previous acceleration approaches that reduce computational cost per iterations or improve initialization, our approach is focused on reducing the number of iterations required for convergence. This is achieved by treating the assignment step and the update step of Lloyd's algorithm as a fixed-point iteration, and applying Anderson acceleration, a well-established technique for accelerating fixed-point solvers. Classical Anderson acceleration utilizes m previous iterates to find an accelerated iterate, and its performance on K-Means clustering can be sensitive to choice of m and the distribution of samples. We propose a new strategy to dynamically adjust the value of m, which achieves robust and consistent speedups across different problem instances. Our method complements existing acceleration techniques, and can be combined with them to achieve state-of-the-art performance. We perform extensive experiments to evaluate the performance of the proposed method, where it outperforms other algorithms in 106 out of 120 test cases, and the mean decrease ratio of computational time is more than 33%.
FLIC: Fast Linear Iterative Clustering With Active Search
Zhao, Jiaxing (Nankai University) | Ren, Bo (Nankai University ) | Hou, Qibin (Nankai University ) | Cheng, Ming-Ming (Nankai University ) | Rosin, Paul (Cardiff University)
In this paper, we reconsider the clustering problem for image over-segmentation from a new perspective. We propose a novel search algorithm named “active search” which explicitly considers neighboring continuity. Based on this search method, we design a back-and-forth traversal strategy and a "joint" assignment and update step to speed up the algorithm. Compared to earlier works, such as Simple Linear Iterative Clustering (SLIC) and its follow-ups, who use fixed search regions and perform the assignment and the update step separately, our novel scheme reduces the iteration number before convergence, as well as improves boundary sensitivity of the over-segmentation results. Extensive evaluations on the Berkeley segmentation benchmark verify that our method outperforms competing methods under various evaluation metrics. In particular, lowest time cost is reported among existing methods (approximately 30 fps for a 481321 image on a single CPU core). To facilitate the development of over-segmentation, the code will be publicly available.
Nested Mini-Batch K-Means
Newling, James, Fleuret, François
A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm, already used data should preferentially be reused. To this end we propose using nested mini-batches, whereby data in a mini-batch at iteration t is automatically reused at iteration t+1. Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1\% of the empirical minimum 100 times earlier than the standard mini-batch algorithm.
Nested Mini-Batch K-Means
Newling, James, Fleuret, François
A new algorithm is proposed which accelerates the mini-batch k-means algorithm of Sculley (2010) by using the distance bounding approach of Elkan (2003). We argue that, when incorporating distance bounds into a mini-batch algorithm, already used data should preferentially be reused. To this end we propose using nested mini-batches, whereby data in a mini-batch at iteration t is automatically reused at iteration t+1. Using nested mini-batches presents two difficulties. The first is that unbalanced use of data can bias estimates, which we resolve by ensuring that each data sample contributes exactly once to centroids. The second is in choosing mini-batch sizes, which we address by balancing premature fine-tuning of centroids with redundancy induced slow-down. Experiments show that the resulting nmbatch algorithm is very effective, often arriving within 1% of the empirical minimum 100 times earlier than the standard mini-batch algorithm.
Fast K-Means with Accurate Bounds
Newling, James, Fleuret, François
We propose a novel accelerated exact k-means algorithm, which performs better than the current state-of-the-art low-dimensional algorithm in 18 of 22 experiments, running up to 3 times faster. We also propose a general improvement of existing state-of-the-art accelerated exact k-means algorithms through better estimates of the distance bounds used to reduce the number of distance calculations, and get a speedup in 36 of 44 experiments, up to 1.8 times faster. We have conducted experiments with our own implementations of existing methods to ensure homogeneous evaluation of performance, and we show that our implementations perform as well or better than existing available implementations. Finally, we propose simplified variants of standard approaches and show that they are faster than their fully-fledged counterparts in 59 of 62 experiments.